Import

Libraries

GISS Model Experiments

Data Catalogue

We read the data catalogue provided by https://pangeo.io/cloud.html:

Check that this dataframe has actually worked:

We can see this tells us the number of different model catagories we are able to choose from.
We now choose the model from this list, with the following specification:

This query yields an intake_esm.core.esm_datastore data type, which we can use in order to get the dataset we have searched for.
Before we do this, let's learn a bit more about the model ensembles we're importing:

A Look into Ensemble Metadata

Notice: The number of historical simulations for each variable is the same, which is good news!

Notice: The number of AMIP simulations for each variable is the same, which is good news!

Downloading the Data

We check what we have downloaded:

Analysing the Model

Checking the Data Structure

The above are xarray.core.dataset.Dataset straight away, meaning they can be readily used for anything one would want with the Xarray package (which is suited to work with gridded meteorological data).

We verify that this has lat x lon = 143 x 144, 21 members in the ensemble, and spans 1958-2017.
Moreover, the variables tas, pr and psl are float32 values within an array (ensemble, time, lat, lon).

We verify that this has lat x lon = 143 x 144, 32 members in the ensemble, and spans 1850-2014.
Moreover, the variables tas, pr and psl are float32 values within an array (ensemble, time, lat, lon).

This gives a nice geometric understanding of the tas array:

Match Time Periods

We must match the time periods of each experiment ensemble, which equates to:

We can see this successfully truncates the starting year to 1958.

Now repeat this code for the AMIP dataset:

Restrict to Australia

Due to different griddings of each model, we have to 'guess' where Australia should be and verify on a projection plot:

Brief Analysis

We can see that these datasets are restricted to the geographical location of Australia.

Visualising the Data

We create a plot of the first few members of the historical ensemble in the first month:

Global

We plot some ensembles:

We now create some Robinson plots for the first ensemble member of each variable:

Australia

...and also an Australia based one:

Creating Probability Distributions

We will use the following algorithm to generate a probability distribution for the tas variable in each experiment:
(1): Match the time-span that each experiment covers.
(2): Across the first ten years and the last ten years of the simulation, calculate the global mean tas for each member of the ensemble.
(3): Calculate the difference in this mean.
(4): Normalise the data and plot as a histogram.

General Function

We create a function that calculates the mean difference in tas, or any other variable for that matter, given the starting time, final time, number of members in the ensemble, and the experiment type.
As this code normally takes a long time to run, we use parallelisation to speed this up. We also save the output array as a text file that we can load in after:

Historical Ensemble

Tas

Run the following code if you have already run the function above:

Now we have an array of the mean temperature change in each ensemble member:

Pr

We repeat this process above for each of the other two variables pr and psl:

We notice that these values are particularly small due to the units of $kg m^{-2}s^{-1}$, which will cause problems in the future. So, we scale up these values to $g m^{-2}d^{-1}$, which equates to scaling up by a value of: $$ 1000 * (24 * 60^2) = 8.64 * 10^{7} $$

Psl

AMIP Ensemble

We repeat this for the AMIP ensemble.

Tas

Run the following code if you have already run the function above:

Now we have an array of the mean temperature change in each ensemble member:

Pr

We notice that these values are particularly small due to the units of $kg m^{-2}s^{-1}$, which will cause problems in the future. So, we scale up these values to $g m^{-2}d^{-1}$, which equates to scaling up by a value of: $$ 1000 * (24 * 60^2) = 8.64 * 10^{7} $$

Psl

Calculating Distances in 3D

Creating Joint Distribution

We re-apply the process above, but for all three variables at once. Hence we need to create the two three-variate array distributions for a distance to be calculated:

We see that these both have shape (number of amip ensembles, number of variables).

Visualisation of the Joint Outputs

Let's see these outputs in a 3D projection plot:

Fitting a 3D Multivariate Gaussian to the Distributions

This will be necessary in order to calculate KL divergence (and JS distance) in a sensible way, as unlike Wasserstein and energy distance, KL divergence requires a 1D pdf for each measure, with a shared support (this is one of the limitations of KL divergence!).

Mean and Covariance

Analysing the Gaussian

This gives us a 1D array of probabilities which can be used for KL divergence calculation:

There's no nice way of visualising the actual Gaussian pdf, as it would be in 4D. The only 'sensible' way would be to look at 3D contour plots, but a 3D contour of a Gaussian are ellipsoids (3D ellipses), which is probably complicated to code and not very helpful visually.

We repeat this process for the historical outputs:

This gives us a 1D array of probabilities which can be used for KL divergence calculation:

KL Divergence

As the probability arrays have different lengths, we randomly select the 'matching' number of elements from the historical probabilities array, calculate the KL divergence, then repeat many times in order to calculate the average KL score.

Note: KL(p||q) asks how different an estimate distribution q is from the reference distribution p. As AMIP is our reference, we take p to this, and q to be the historical output.

Alternative using closed form:

JS Distance

Wasserstein

To calculate the Wasserstein distance, we ignore the fitted Gaussian and perform the following steps:
(1): Set the 3D points in each experiment to have a uniformly distributed mass
(2): Calculate the cost matrix associated to the position of the 3D points in the two experiments
(3): Use the weights and cost matrix to calculate the (approximate) Wasserstein distance/optimal transport cost.

p=1

p=2

p=3

Summarise

Old

Energy Distance

We use the Dcor package, which estimates the energy distance using Hoeffding's unbiased U statistics (see Dcor documentation for more info).
However, the result returned is the squared distance, which is not a metric. So, we take the square root.

My implementation, using the empirical estimators of expectation found by summing pairwise differences: